Part of Speech Tagging for Text Clustering in Swedish

نویسنده

  • Magnus Rosell
چکیده

Text clustering could be very useful both as an intermediate step in a large natural language processing system and as a tool in its own right. The result of a clustering algorithm is dependent on the text representation that is used. Swedish has a fairly rich morphology and a large number of homographs. This possibly leads to problems in Information Retrieval in general. We investigate the impact on text clustering of adding the part-of-speech-tag to all words in the the common term-bydocument matrix. The experiments are carried out on a few different text sets. None of them give any evidence that part-of-speech tags improve results. However, to represent texts using only nouns and proper names gives a smaller representation without worsen results. We also investigate the effect of lemmatization and the use of a stoplist, both of which improves results significantly in some cases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...

متن کامل

Porting a Stochastic Part-of-Speech Tagger to Swedish

A b stract The Xerox Part-of-Speech Tagger (XPOST) claims to be practical. One aspect of practicality as defined here is reusability. Thus it is meant to be easy to port XPOST to a new language. To test this, XPOST was ported to Swedish. This port is described and evaluated. In previous work on part-of-speech tagging, a practical part-of-speech tagger was defined as one with the following set o...

متن کامل

The Open Source Tagger HunPoS for Swedish

HunPoS, a freely available open source part-of-speech tagger—a reimplementation of one of the best performing taggers, TnT—is applied to Swedish and evaluated when the tagger is trained on various sizes of training data. The tagger’s accuracy is compared to other data-driven taggers for Swedish. The results show that the tagging performance of HunPoS is as accurate as TnT and can be used effici...

متن کامل

Finite state segmentation of discourse into clauses

The paper presents background and motivation for a processing model that segments discourse into units that are simple, non-nested clauses, prior to the recognition of clause internal phrasal constituents, and experimental results in support of this model. One set of results are derived from a statistical reanalysis of the Swedish empirical data in 18] concerning the linguistic structure of maj...

متن کامل

Part-of-Speech Tagging Using the Brill Method

Part-of-speech tagging is the process of associating each word in a text with it’s part-of-speech category and possibly a set of morphosyntactic features. This information is represented by part-of-speech tags. This paper describes an implementation of a part-of-speech tagger for Swedish based on the Brill method. The basic idea is to apply a set of rules to an initial annotation achieved using...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009